145 min approx
pivot_*, separate, unite function from the tidyr package in the Tidyverse to reshape data into tidy one.Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
There are three interrelated rules that make a dataset tidy:
Example: tidyverse::billboard dataset.1
tidyr::pivot_longer
Important
tidyr::pivot_longer convert your data in “longer” fromatcols: select which variable should be pivotingnames_to: define the column hosting the cols colnamesvalues_to: define the column hosting the cols valuesWarning
Many possibly uninformative missing information!
tidyr::pivot_longer
Important
tidyr::pivot_longer convert your data in “longer” fromatcols: select which variable should be pivotingnames_to: define the column hosting the cols colnamesvalues_to: define the column hosting the cols valuesvalues_drop_na: decide if rows with missing information in values should be removedvar1:var10: variables lying between var1 on the left and var10 on the right.
starts_with("a"): names that start with “a”.
ends_with("z"): names that end with “z”.
contains("b"): names that contain “b”.
matches("x.y"): names that match regular expression x.y. 2
num_range(x, 1:4): names following the pattern, x1, x2, …, x4.
all_of(vars)/any_of(vars): names stored in the character vector vars. all_of(vars) will error if the variables aren’t present; any_of(var) will match just the variables that exist.
everything(): all variables.
last_col(): furthest column on the right.
where(is.numeric): all variables where is.numeric() returns TRUE.
Tip
!selection: only variables that don’t match selection.
selection1 & selection2: only variables included in both selection1 and selection2.
selection1 | selection2: all variables that match either selection1 or selection2
Tip
In case of multiple variable in each colname, you can pivoting them maintaining the underling structure. This way you can separate them in a furhter second step usign tidyr::separate.
Your turn
Connect to our pad(https://bit.ly/ubep-rws-pad)
Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio)
…and:
pivot_longer?names_fromnames_tovalues_fromvalues_to10-pivot_longer.R and follow the instruction step by step.01:00
Important
To transform a table to a longer one, you need to put some of its columns names_to a new column, and their corresponding values_to another one! Possibly allowing values_drop_na.
tidyr::pivot_widerImage from Data Carpentry’s R for Social Scientists
tidyr::pivot_wider
[1] TRUE
Your turn
Connect to our pad(https://bit.ly/ubep-rws-pad)
Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio)
…and:
pivot_wider?names_fromnames_tovalues_fromvalues_to11-pivot_wider.R and follow the instruction step by step.01:00
Important
To transform a table to a wider one, you need to take new column names_from an existing column, and their corresponding values_from the associated one! Possibly with created missing values_filled.
dplyr - introCommon structure:
Tip
All verbs in Tidyverse are designed to do one thing mainly, and to it well! So, to solve complex problem we will often combine multiple verbs, and we use the pipe (|>) as we are already familiar!
dplyr::filter
Important
dplyr::filter allows you to keep rows based on the values of the columns.
We can use any kind of condition inside dplyr::filter; e.g.,
We can use any kind of condition inside dplyr::filter; e.g.,
We can use any kind of condition inside dplyr::filter; e.g.,
We can use any kind of condition inside dplyr::filter; e.g.,
We can also combine together multiple condition of arbitrary compelxity at once
Tip
It could be difficult to remind the priority order of logical operators. Using parentheses to group each conditions is a safe way to not be wrong!
Your turn
Connect to our pad(https://bit.ly/ubep-rws-pad)
Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio)
…and:
age, and you want to keep rows with age equal to 18 or 21. Before to evaluate it, does the following code return what you need? Answer in the pad, under the section 3.2. Ex20.12-filter.R, and follow the instruction step by step.02:00
Important
dplyr::filter is always a data framedplyr::filter is always a data framedplyr::filter, never!Important
dplyr::select
For analyses, you do not need to remove columns from your dataset, but it could be extremely useful to see more clearly only the data you need to see time to time.1
You can select the column to keep using the dplyr::select() verb providing:
dplyr::select
For analyses, you do not need to remove columns from your dataset, but it could be extremely useful to see more clearly only the data you need to see time to time.1
You can select the column to keep using the dplyr::select() verb providing:
dplyr::select
For analyses, you do not need to remove columns from your dataset, but it could be extremely useful to see more clearly only the data you need to see time to time.1
You can select the column to keep using the dplyr::select() verb providing:
!)dplyr::select
For analyses, you do not need to remove columns from your dataset, but it could be extremely useful to see more clearly only the data you need to see time to time.1
You can select the column to keep using the dplyr::select() verb providing:
where
var1:var10: variables lying between var1 on the left and var10 on the right.
starts_with("a"): names that start with “a”.
ends_with("z"): names that end with “z”.
contains("b"): names that contain “b”.
matches("x.y"): names that match regular expression x.y. 2
num_range(x, 1:4): names following the pattern, x1, x2, …, x4.
all_of(vars)/any_of(vars): names stored in the character vector vars. all_of(vars) will error if the variables aren’t present; any_of(var) will match just the variables that exist.
everything(): all variables.
last_col(): furthest column on the right.
where(is.numeric): all variables where is.numeric() returns TRUE.
Tip
!selection: only variables that don’t match selection.
selection1 & selection2: only variables included in both selection1 and selection2.
selection1 | selection2: all variables that match either selection1 or selection2
Your turn
Connect to our pad(https://bit.ly/ubep-rws-pad)
Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio)
…and:
Before to evaluate it, in the pad, under the section 3.2. Ex21, write (in a new line) all the possible ways you can immagine to select the variable sex, age, group using dplyr::select from our dataframe db imported from Copenhagen_clean.xlsx .
What do you expect the following code will return (including an error):
13-select.R and follow the instruction step by step.05:00
Important
all_of(vec) is for strict selection. If any of the variables in the character vec is missing, an error is thrown.any_of(vec) doesn’t check for missing variables. It is especially useful with negative selections, when you would like to make sure a variable is removed.We can also add new columns which are calculated from existing ones.
We can also add new columns which are calculated from existing ones.
We can also add new columns which are calculated from existing ones.
We can also add new columns which are calculated from existing ones.
Warning
min, max):
pmin, pmax):
Your turn
Connect to our pad(https://bit.ly/ubep-rws-pad)
Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio)
…and:
3.2. Ex22 write your guess respect the output of using dplyr::mutate assigning the same name of an already existing variable. E.g.14-mutate.R and follow the instruction step by step.02:00
Important
As all the other verbs in the Tidyverse, dplyr::mutate ::: columns ::: {.column width=“50%”} - It takes a data frame in input, always. - It returns a data frame in output, always.
::: :::
{stringr}
Instructions
Your turn
homework/day_three-summative.html
homework/solution.R
To create the current lesson, we explored, used, and adapted content from the following resources:
The slides are made using Posit’s Quarto open-source scientific and technical publishing system powered in R by Yihui Xie’s Kintr.
This work by Corrado Lanera, Ileana Baldi, and Dario Gregori is licensed under CC BY 4.0

UBEP’s R training for supervisors